My Experience on Metagenomics and Workflows

aMeta Workhsop

Emrah Kırdök, Ph.D.

7/4/24

Who am I?

  • Emrah Kırdök, Ph.D.
  • Trained as a biologist
  • Working on ancient metagenomics
  • Giving lectures on bioinformatics and data analysis

Outline

  • Ancient metagenomics is still a new discipline
  • We do not have very well defined methods
  • Some struggle is inevitable
  • Lessons learnt, some good practice and tricks
  • How aMeta helped me to understand snakemake

Struggles

Bioinformatics methdodology

A simple methodology

g input1 Input1 tool Tool input1->tool input2 Input2 input2->tool parameters parameters parameters->tool output1 Output1 tool->output1 output2 Output2 tool->output2

Bioinformatics methdodology

Actually it is much more complicated…

g input1 Input1 tool1 Tool1 input1->tool1 input2 Input2 input2->tool1 parameters parameters parameters->tool1 output1 Output1 tool1->output1 output2 Output2 tool1->output2 tool2 Tool2 output2->tool2 output3 Output3 tool2->output3 moreparameters moreparameters moreparameters->tool2

Ancient metagenomics methodology

And in ancient metagenomics, it is much more complicated…

g fastq Fastq file classification Classification fastq->classification output Classification Output authentication Extract aDNA reads and authenticate output->authentication classification->output authentication_1 authentication_1 authentication->authentication_1 authentication_2 authentication_2 authentication->authentication_2 authentication_3 authentication_3 authentication->authentication_3 authentication_4 authentication_4 authentication->authentication_4 authentication_N authentication_N authentication->authentication_N database database database->classification database->authentication Final Report Final Report authentication_1->Final Report authentication_2->Final Report authentication_3->Final Report authentication_4->Final Report authentication_N->Final Report

Ancient metagenomics methodology

It is even more complicated in a real situation:

The aMeta Workflow

Authentication part of the pipeline

It would be very hard to authenticate one by one

Workflow

Workflows

  • Bash scripts are quite easy to write
  • But, every time you run it starts from beginning
  • Every bash script could be specific to one job
  • Job dependency (with slurm?)

My first workflow

Snakemake

  • aMeta allowed me to learn snakemake
  • Snakemake changed my view
  • A fully robust system that can automatically send slurm jobs

Helps you to keep a tidy folder

Project name
├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original read only
│
├── docs               <- All the document information should go here
│   ├── reports
│   └── presentations
│
├── workflow           <- Source code and snakemake rules 
│   ├── Snakefile      <- A snakefile, should include all sub rules
│   │
│   ├── rules          <- Seperate snakefile rules
│   │
│   ├── environments   <- Conda environments
│   │
│   └── singularity    <- Singularity containers
│ 
└── results

Documenting

  • I generally place a README.md file
  • I generally write on markdown
  • Documenting is crucial in bioinformatics

Use Github

  • Version controlling
  • Share an collaborate
  • Also help you to backup